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The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable 
service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases com- 
prise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence Identity 
(level-1) and patent family (level-2). Annotation from the source entries In these databases Is merged and enhanced with 
additional Information from the patent literature and biological context. Corrections In patent publication numbers, 
kind-codes and patent equivalents significantly Improve the data quality. Data are available through various user Interfaces 
Including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence simllarlty/homology searches against the 
databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation 
and also outline major changes and improvements Introduced since 2009. Apart from data growth, these changes Include 
additional annotation for singleton clusters, the Identifier versioning for tracking entry change and the entry mappings 
between the two-level databases. 

Database URL: http://www.ebi.ac.uk/patentdata/nr/ 



Introduction 

The patent data are a valuable resource, not only for the 
intellectual property world but also for the scientific com- 
munity (1,2). During the past 15 years, the number of bio- 
logical sequences appearing In patent documents has been 
increasing constantly (3). Today, >30 million nucleotide 
and protein sequences extracted from patent documents 
are available In the public domain (shown by the black 
lines In Figure 1). Searching this large amount of patent 
sequence data has become one of the key approaches In 
patent-related studies (4,5). Proprietary data also exist from 
the commercial sector providing alternative annotations of 
patent sequence data, such as GENESEQ^'^ (Thomson Reu- 
ters, http://thomsonreuters.com/products_servlces/science/ 
science_products/a-z/geneseq/), GQ-PAT (GenomeQuest, 



http://wiki.genomequest.com/lndex.php/GQ_Pat), USGENE 
(SequenceBase, http://www.sequencebase.com/usgene- 
sequences-database), but these require commercial li- 
censes, which Impose usage restrictions on the data. 

The EMBL-European Bioinformatics Institute (EMBL-EBI) 
offers free and unrestricted access to patent sequence re- 
sources, providing a valuable service to the Intellectual 
property and bioscience communities (6). The two-level 
non-redundant (NR) patent sequence databases, based on 
sequence identity and patent family clusters, are compre- 
hensive repositories for patent Information on nucleotide 
and protein sequences provided by the European Patent 
Office (EPO), the US Patent and Trademark Office 
(USPTO), the Japanese Patent Office (JPO) and the Korean 
Intellectual Property Office (KlPO) and include the World 
Intellectual Property Organization (WlPO) patents from 



© The Author(s) 2013. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// 
creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, 
provided the original work is properly cited. Page 1 of 6 

(page number not for citation purposes) 



Original article 



Database, Vol. 2013, Article ID batOOS, dol:10.1093/database/bat005 




Release timeline 



Figure 1. Data growth of patent sequence data. The left-side Y-axis shows the number of sequence entries; the right-side Y-axis 
indicates the number of patents and patent families; the X-axis represents the release timeline. The black lines show the 
increasing number of source biological sequences; other coloured lines illustrate the trends of the NR patent sequence databases 
following the increase in source data. Note: The number of entries of level-2 clusters (NRNL2 and NRPL2) can decrease due to 
deletions and merging of patent family assignments and patent corrections, for example, in the cases of Release 10 (Oct 2011) 
and Release 13 (Oct 2012). 



these offices. The NR patent sequences are enriched with 
biological annotation and additional data from patent 
documents. These databases, serving as a repository of 
scientific innovation and inspiration, are an important re- 
source for patent-related searches, especially for determin- 
ing potential commercial use of biological sequences and 
their patentability. In this article, we describe the sequence 
collection and annotation of the NR patent sequence data- 
bases, and introduce improvements and development of 
the databases over the past 3 years. 

Sequence Collection and 
Annotation 

The NR patent sequence data sources cover nucleotide and 
protein sequences in patent applications from the EPO, the 
JPO, the KlPO and the USPTO. The patent sequence data 



deposited to ENA (7), GenBank (8) or the DDBJ (9) are 
exchanged between these databases through the Interna- 
tional Nucleotide Sequence Database Collaboration 
(Figure 2a). Sequence submissions by the inventors can be 
made as part of the patent application process using tools, 
such as BISSAP (http://www.epo.org/bissap/), an application 
developed to facilitate the creation of sequence listings 
(WlPO ST.25 and proposed XML format) for patent applica- 
tions containing biological sequences by the EPO in collab- 
oration with national patent offices and the EMBL-EBI. 

The NR patent sequence databases have been created at 
two levels to remove sequence redundancy by using se- 
quence MDS (Message-Digest algorithm 5, http://www. 
faqs.org/rfcs/rfc1321.html) checksums and patent family in- 
formation, comprising NR patent nucleotides level-1 and -2 
and NR patent proteins level-1 and -2. Level-1 sequences 
are 100% identical over their entire lengths, arising from 
either the same or different patent families; level-2 



Page 2 of 6 



Database. Vol. 2013, Article ID bat005, dol:10.1093/database/bat005 



Original article 



USPTO 



EPO 



JPO 



KlPO 



^GenBank^^ 



(b) 



1 






OPS 











EPOPNR 



JPOPNR 



KPOPNR 



USPOPNR 



NRPLl 



Tsmr 



f PaTen! 1 
[Equivalents J 



^ I^RPL2 ^ NRNL2 ^ 
I Patent Family Info & Corrections (PN/KC/Publication Level) | 
[ Cluster Annotation | 



Feature Annotation 



\ ^ 2-Level ID Mappings |[ IDversioning J y/ 

& 



(C) 




Vsearchy 



Figure 2. Data flow for the NR patent sequence databases, (a) 
Data sources consist of patent sequences from the patent of- 
fices of the EPO, the JPO, the KlPO and the USPTO, as well as 
the patent family data from the OPS. (b) Data collection and 
annotation. The resulting databases include the sequence clus- 
ters level-1 (NRNL1, NRPLl, EPOPNR, JPOPNR, KPOPNR and 
USPOPNR) and level-2 (NRNL2 and NRPL2), the patent equiva- 
lent database and other relevant result files, (c) Data access 
through FTP, DbFetch, SRS, EBI-Search and SSS (Sequence 
Similarity/Homology Search). 



sequences are 100% identical over their entire length and 
belong to the same patent family. Patent family informa- 
tion for source sequences is retrieved from the EPO Open 
Patent Services (OPS) (10). Level-1 databases include NR nu- 
cleotide patent sequence clusters level-1 (NRNL1), NR pro- 
tein patent sequence clusters level-1 (NRPLl) and NR 
protein patent sequence clusters from individual patent of- 
fices (EPOPNR for the EPO, JPOPNR for the JPO, KPOPNR for 
the KlPO and USPOPNR for the USPTO). Level-2 databases 
contain NR nucleotide patent sequence clusters level-2 
(NRNL2) and NR protein patent sequence clusters level-2 
(NRPL2) (Figure 2b). The method used to remove sequence 
redundancy is detailed in an article by Li et al. 2010 (6). 

The patent equivalents database is also developed to 
provide patent family information extracted from the OPS 
service for the sequences collected in this study (Figure 2b). 



In patents, a right of priority is a time-limited right trig- 
gered by the first filing of a patent application; a patent 
family refers to several patent applications or publications 
for an individual invention, claiming exactly the same pri- 
ority or priorities; all of these family equivalents are related 
to each other by common priority numbers and associated 
priority dates (http://www.epo.org/searching/essentials/ 
patent-families/about. html). The family information in the 
database covers patent family numbers, patent priority, 
master publications, patent equivalents, subsequent publi- 
cation levels and patent classification. The database format 
is detailed in the user manual (http://www.ebi.ac.uk/patent- 
data/doc/Family_equivalents_data base_v3.pdf). 

The annotation of the NR sequences comprises cluster 
member annotation, patent family information and biolo- 
gical features. The cluster member annotation includes 
source sequence information, e.g. identifier (ID), molecular 
type, sequence length, source database, patent number 
and a general description. The patent family information 
consists of family number, master publication, patent prior- 
ity, earliest publication date and the EPO and international 
classifications. The earliest publication date is determined 
to identify relevant prior art of the patent by comparing 
the patent publication dates of all the members of a NR 
sequence cluster. The biological features contain informa- 
tion on organisms, coding sequence regions, genes, vari- 
ations, combined for both contig and singleton members. 
This combined annotation allows better exploration of the 
original patent applications for related intellectual prop- 
erty data. It also provides better cross-references to exter- 
nal data resources and improves the biological context at 
the sequence level. The annotation format is detailed in the 
user manual (http://www.ebi.ac.uk/patentdata/doc/Non- 
redundant_databases-user_manuaLv3.pdf). 

Data Growth and Improvements 

The NR patent sequence databases are released every 3 or 4 
months, but usually following EMBL-Bank's quarterly re- 
lease cycles. The current release (Release 13, Oct 2012) con- 
tains 12279 969 NRNL1, 14920929 NRNL2, 2580442 NRPLl 
and 3697317 NRPL2, -2.4-, 2.2-, 1.9- and 1.6-fold in size 
compared with the first release of NRNL1, NRNL2, NRPLl 
and NRPL2, respectively, covering over 6571 318 proteins 
and 24364832 nucleotides from 184447 patents (130 538 
unique patent families), which are provided by the pa- 
tent offices of the EPO, the JPO,the KlPO and the USPTO 
(Table 1, Figure 1). The data coverage is slightly larger than 
the commercial patent sequence database GENESEQ, which 
included >27 million sequences from > 150 000 patents in 
Oct 2012 (http://thomsonreuters.com/products_services/sci 
ence/science_products/a-z/geneseq/). 

Patent publication numbers, sequence kind-codes and 
patent equivalents are corrected or updated in each release 
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Table 1. Summary of the NR patent sequences and the patent 
families in Release 13 





Number of 
entries 


Redundancy 
before 


Patent nucleotides 


24364832 




NRNL1 


12 279 969 


1.98 


NRNL2 


14920 929 


2.22 


Patent proteins 


6571 318 




NRPL1 


2 580442 


1.88 


NRPL2 


3697317 


1.62 


Patents 


184447 




Unique patent families 


130 538 


1.41 



using the latest patent family data from the EPO's OPS. 
Across all releases, 43111 patent numbers and 14330 se- 
quence kind-codes have been corrected; 102 227 patent 
numbers have been involved in the patent family assign- 
ment. The corrected publication numbers link to the correct 
full-text patent documents; the corrected publication kind 
codes and the publication levels indicate the legal status 
and progress through the patent application process. 

The ID mappings between level-1 and level-2 databases 
have been generated since Release 10 to clearly illustrate 
how identical sequences from level-1 databases are 
mapped to level-2 database entries according to their 
patent family information. Figure 3 has two examples 
that illustrate how sequences from level-1 nucleotide and 
protein sequences are clustered into level-2 entries. These 
mappings offer a useful explanation of the relationship be- 
tween identical sequences within or outside of a patent 
family. 

Members of level-2 clusters in an old release can move to 
other clusters in a new release. This is due to changes in 
equivalents assignment in patent families. The ID version- 
ing has been provided since Release 6 for direct tracking of 
entry history. This functionality is necessary for recovering 
information from old entries that have moved or have 
become obsolete in a new release. 

Data Access and Usage 

Data access to the NR patent sequence resources has 
become more and more important to the user community 
as the volume of sequence data increases. The NR patent 
sequence databases can be accessed through four major 
routes (Figure 2c) at EMBL-EBI: 

(1) The flat files can be downloaded through the data- 
bases website (http://www.ebi.ac.uk/patentdata/nr/) 
and through the FTP site (ftp://ftp.ebi.ac.uk/pub/data 
bases/patentdata/). 



(2) The EMBL-like formatted annotation data can be 
retrieved on a per-accession through the Dbfetch/ 
WSDbfetch service (http://www.ebi.ac.uk/Tools/ 
dbfetch/dbfetch/) and also through the SRS server 
(http://srs.ebi.ac.uk/). 

(3) Sequence similarity/homology searches including 
FASTA (11), BLAST (12,13) and PSI-Search (14) against 
the databases are available through the web form 
submissions (http://www.ebi.ac.uk/Tools/sss/) and also 
through the corresponding EMBL-EBI SOAP/REST 
web services (1 5). 

(4) Keyword searches can be made using the EBI-Search 
engine (16) through both a web form (http://www. 
ebi.ac.uk/ebisearch) and the corresponding SOAP 
web services. 

Approximately 10 000 sequence similarity/homology 
searches were performed using the databases during 
2010. This grew to >36 000 searches in 2011, and it is esti- 
mated that ~37 500 searches will have taken place during 
2012. The same trend is seen for data retrieval via Dbfetch/ 
WSDbfetch, which have grown from 450 000 in 2011 to a 
projected 510000 for 2012. FTP downloads of these se- 
quence data have also grown from 394 downloads in 
2011 to a projected 540 for 2012. 

Discussion and Future 
Implementation 

The NR patent sequence databases are the first publicly 
available collection of NR patent sequences, at both the 
sequence and patent-family levels. Other efforts in the 
public domain have been made to collate NR patent se- 
quence data to improve access and use of these data, 
such as PatGen (17) and Patome (18). Unfortunately, 
PatGen is no longer available online; the sequence redun- 
dancy in Patome was defined according to the patent 
number and the sequence ID in the sequence listing. As a 
result, identical sequences granted with different patent 
numbers by different patent offices are not classified. 

Sequence similarity/homology searching against the NR 
patent sequence databases has become a fundamental ap- 
proach in patent-related studies. Searches against NR se- 
quences are faster and more sensitive than the equivalent 
searches against redundant libraries, and the search results 
are easier to interpret. Searches against level-1 clusters can 
result in identical or similar patent sequences; searches 
against level-2 clusters can result in identical or similar se- 
quences from the same invention. These searches can be 
used to find the published patents that cite a sequence 
and the patent families associated with a sequence, to dis- 
cover the earliest priority data and the equivalents of a 
patent family, and to retrieve biological annotation ex- 
tracted from patent documents. 
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Figure 3. Two example entries illustrating the mapping between identical sequences from level-1 to level-2. (a) The NRNL1 entry 
NRN_AX241249 contains five member sequences, which are 100% identical over their full-length but clustered into four NRNL2 
entries according to their patent family information: NRN00208E35 (family number 22673211, containing the member sequences 
AX241249 and DJ381174), NRN00208E36 (family number 27401191, containg the member sequence AX487735), NRN00208E37 
(family number 32911719, containing the member sequence AR579342) and NRN00208E38 (DI090734 as member sequence and 
family number unknown), (b) The NRPLl entry NRP_AX240833 contains four member sequences that are clustered into three 
NRPL2 entries according to their patent family information. 



The NR patent sequence databases are an important 
resource for patent-related searches, especially for determin- 
ing potential commercial use of biological sequences. The ear- 
liest publication dates offer direct tracking of patent- 
application history, enabling effective searches on prior art. 
The corrections on the publication numbers and kind codes 
enhance the data quality, enabling proper cross-referencing 
to full-text patent documents. These databases are also a re- 
pository of scientific innovation and inspiration. 

We will continue to make improvements and add new fea- 
tures in the future. For example, to broaden data coverage by 
including data from other national and regional patent of- 
fices, to shorten the release cycle to a monthly schedule and to 
integrate cross-references to claimed sequences and provide 
claimed status. Currently, users can download the ID history 
tables to track entry changes, such as status, and entry add- 
itions, deletions, merging and unmerging; in the future, an 
online searchable system will be implemented. 
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